AITopics | language resource

Collaborating Authors

language resource

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

ce9e92e3de2372a4b93353eb7f3dc0bd-Paper-Datasets_and_Benchmarks.pdf

Neural Information Processing SystemsFeb-19-2026, 12:00:31 GMT

computational linguistic, corpus, dataset, (11 more...)

Neural Information Processing Systems

Country:

Europe > Slovenia (0.04)
Europe > Middle East > Republic of Türkiye > Istanbul Province > Istanbul (0.04)
Europe > Germany > Saxony > Leipzig (0.04)
(29 more...)

Industry: Health & Medicine > Therapeutic Area (0.67)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Data Science (1.00)
Information Technology > Communications > Social Media (1.00)
(4 more...)

Add feedback

1e6057620ed314b0020b3a30284b0f83-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsFeb-9-2026, 03:32:29 GMT

computational linguistic, dataset, glotcc, (15 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
Asia > Indonesia > Bali (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(24 more...)

Genre: Research Report (0.67)

Industry:

Law (0.93)
Information Technology (0.67)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Communications > Social Media (0.93)
(4 more...)

Add feedback

BERnaT: Basque Encoders for Representing Natural Textual Diversity

Azurmendi, Ekhi, de Landa, Joseba Fernandez, Bengoetxea, Jaione, Heredia, Maite, Etxaniz, Julen, Zubillaga, Mikel, Soraluze, Ander, Soroa, Aitor

arXiv.org Artificial IntelligenceDec-4-2025

Language models depend on massive text corpora that are often filtered for quality, a process that can unintentionally exclude non-standard linguistic varieties, reduce model robustness and reinforce representational biases. In this paper, we argue that language models should aim to capture the full spectrum of language variation (dialectal, historical, informal, etc.) rather than relying solely on standardized text. Focusing on Basque, a morphologically rich and low-resource language, we construct new corpora combining standard, social media, and historical sources, and pre-train the BERnaT family of encoder-only models in three configurations: standard, diverse, and combined. We further propose an evaluation framework that separates Natural Language Understanding (NLU) tasks into standard and diverse subsets to assess linguistic generalization. Results show that models trained on both standard and diverse data consistently outperform those trained on standard corpora, improving performance across all task types without compromising standard benchmark accuracy. These findings highlight the importance of linguistic diversity in building inclusive, generalizable language models.

artificial intelligence, computational linguistic, natural language, (14 more...)

arXiv.org Artificial Intelligence

2512.03903

Country:

North America > United States (0.46)
North America > Mexico (0.28)
Europe > Austria (0.28)
Asia > Middle East > UAE (0.14)

Genre: Research Report > New Finding (0.88)

Technology: Information Technology > Artificial Intelligence > Natural Language (1.00)

Add feedback

The PLLuM Instruction Corpus

Pęzik, Piotr, Żarnecki, Filip, Kaczyński, Konrad, Cichosz, Anna, Deckert, Zuzanna, Garnys, Monika, Grabarczyk, Izabela, Janowski, Wojciech, Karasińska, Sylwia, Kujawiak, Aleksandra, Misztela, Piotr, Szymańska, Maria, Walkusz, Karolina, Siek, Igor, Chrabąszcz, Maciej, Kołos, Anna, Karlińska, Agnieszka, Seweryn, Karolina, Krasnodębska, Aleksandra, Betscher, Paula, Cieślińska, Zofia, Kowol, Katarzyna, Wilczek, Artur, Trzciński, Maciej, Dziewulska, Katarzyna, Roszko, Roman, Bernaś, Tomasz, Vaičenonienė, Jurgita, Roszko, Danuta, Levchuk, Paweł, Kowalski, Paweł, Prawdzic-Jankowska, Irena, Kozłowski, Marek, Dadas, Sławomir, Poświata, Rafał, Wróblewska, Alina, Krasnowska-Kieraś, Katarzyna, Ogrodniczuk, Maciej, Rudolf, Michał, Rybak, Piotr, Saputa, Karolina, Wołoszyn, Joanna, Oleksy, Marcin, Koptyra, Bartłomiej, Ferdinan, Teddy, Woźniak, Stanisław, Piasecki, Maciej, Walkowiak, Paweł, Wojtasik, Konrad, Janz, Arkadiusz, Kazienko, Przemysław, Moska, Julia, Kocoń, Jan

arXiv.org Artificial IntelligenceNov-24-2025

This paper describes the instruction dataset used to fine-tune a set of transformer-based large language models (LLMs) developed in the PLLuM (Polish Large Language Model) project. We present a functional typology of the organic, converted, and synthetic instructions used in PLLuM and share some observations about the implications of using human-authored versus synthetic instruction datasets in the linguistic adaptation of base LLMs. Additionally, we release the first representative subset of the PLLuM instruction corpus (PLLuMIC), which we believe to be useful in guiding and planning the development of similar datasets for other LLMs.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2511.17161

Country:

Asia (1.00)
Europe > Poland (0.67)

Genre: Research Report (1.00)

Industry:

Leisure & Entertainment (1.00)
Education (0.92)
Information Technology (0.68)
Media > Film (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

Ljubešić, Nikola, Rupnik, Peter, Porupski, Ivan, Pungeršek, Taja Kuzman

arXiv.org Artificial IntelligenceNov-4-2025

ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6 thousand hours in size. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora is significantly enriched with various automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similar to that, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent disfluency in typical speech. Two out of the four languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation of the position of primary stress in multisyllabic words. With these enrichments, the usefulness of the underlying corpora has been drastically increased for downstream research across multiple disciplines, which we showcase through an analysis of acoustic correlates of sentiment. All the corpora are made available for download in JSONL and TextGrid formats, as well as for search through a concordancer.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2511.01619

Country: Europe > Austria (0.28)

Genre: Research Report > Experimental Study (0.94)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

Jumelet, Jaap, Fourtassi, Abdellah, Haga, Akari, Bunzeck, Bastian, Shandilya, Bhargav, Galvan-Sosa, Diana, Haznitrama, Faiz Ghifari, Padovani, Francesca, Meyer, Francois, Hu, Hai, Etxaniz, Julen, Prévot, Laurent, He, Linyang, Grandury, María, Marcheva, Mila, Foroutan, Negar, Theodoropoulos, Nikitas, Sadeghi, Pouya, Song, Siyuan, Salhan, Suchir, Zhou, Susana, Paniv, Yurii, Zhang, Ziyin, Bisazza, Arianna, Warstadt, Alex, Choshen, Leshem

arXiv.org Artificial IntelligenceOct-14-2025

We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.

computational linguistic, large language model, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2510.10159

Country:

North America > United States (1.00)
Europe (1.00)
Asia (1.00)
Africa (0.67)

Genre: Research Report > New Finding (0.67)

Industry:

Media (1.00)
Government (0.93)
Education > Educational Setting > K-12 Education (0.68)
Leisure & Entertainment (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.47)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.45)

Add feedback

1e6057620ed314b0020b3a30284b0f83-Paper-Datasets_and_Benchmarks_Track.pdf

Neural Information Processing SystemsOct-9-2025, 20:28:48 GMT

computational linguistic, dataset, glotcc, (16 more...)

Neural Information Processing Systems

Country:

Europe > Germany > Bavaria > Upper Bavaria > Munich (0.05)
Asia > Indonesia > Bali (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
(24 more...)

Genre: Research Report (0.67)

Industry:

Law (0.93)
Information Technology (0.93)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Communications > Social Media (0.93)
(4 more...)

Add feedback

Charting the Landscape of African NLP: Mapping Progress and Shaping the Road Ahead

Alabi, Jesujoba O., Hedderich, Michael A., Adelani, David Ifeoluwa, Klakow, Dietrich

arXiv.org Artificial IntelligenceOct-3-2025

With over 2,000 languages and potentially millions of speakers, Africa represents one of the richest linguistic regions in the world. Yet, this diversity is scarcely reflected in state-of-the-art natural language processing (NLP) systems and large language models (LLMs), which predominantly support a narrow set of high-resource languages. This exclusion not only limits the reach and utility of modern NLP technologies but also risks widening the digital divide across linguistic communities. Nevertheless, NLP research on African languages is active and growing. In recent years, there has been a surge of interest in this area, driven by several factors-including the creation of multilingual language resources, the rise of community-led initiatives, and increased support through funding programs. In this survey, we analyze 884 research papers on NLP for African languages published over the past five years, offering a comprehensive overview of recent progress across core tasks. We identify key trends shaping the field and conclude by outlining promising directions to foster more inclusive and sustainable NLP research for African languages.

computational linguistic, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2505.21315

Country:

Asia (1.00)
Africa (1.00)
Europe > Spain (0.67)
North America > United States > Minnesota (0.27)

Genre: Overview (1.00)

Industry:

Health & Medicine (1.00)
Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(4 more...)

Add feedback

DeDisCo at the DISRPT 2025 Shared Task: A System for Discourse Relation Classification

Ju, Zhuoxuan, Wu, Jingni, Purushothama, Abhishek, Zeldes, Amir

arXiv.org Artificial IntelligenceSep-23-2025

This paper presents DeDisCo, Georgetown University's entry in the DISRPT 2025 shared task on discourse relation classification. We test two approaches, using an mt5-based encoder and a decoder based approach using the openly available Qwen model. We also experiment on training with augmented dataset for low-resource languages using matched data translated automatically from English, as well as using some additional linguistic features inspired by entries in previous editions of the Shared Task. Our system achieves a macro-accuracy score of 71.28, and we provide some interpretation and error analysis for our results.

computational linguistic, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2509.11498

Country:

Europe (1.00)
South America (0.67)
North America > United States > Maryland (0.28)
Asia > Japan > Honshū (0.28)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

DIVERS-Bench: Evaluating Language Identification Across Domain Shifts and Code-Switching

Ojo, Jessica, Kamel, Zina, Adelani, David Ifeoluwa

arXiv.org Artificial IntelligenceSep-23-2025

Language Identification (LID) is a core task in multilingual NLP, yet current systems often overfit to clean, monolingual data. This work introduces DIVERS-BENCH, a comprehensive evaluation of state-of-the-art LID models across diverse domains, including speech transcripts, web text, social media texts, children's stories, and code-switched text. Our findings reveal that while models achieve high accuracy on curated datasets, performance degrades sharply on noisy and informal inputs. We also introduce DIVERS-CS, a diverse code-switching benchmark dataset spanning 10 language pairs, and show that existing models struggle to detect multiple languages within the same sentence. These results highlight the need for more robust and inclusive LID systems in real-world settings.

computational linguistic, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.17768

Country:

Europe (1.00)
North America > Canada (0.46)
Asia > Middle East (0.28)
(2 more...)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.93)
Information Technology > Communications > Social Media (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)

Add feedback